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p • Abstract 

' We present a heuristic search algorithm for solving first-order Markov Decision Pro- 

cesses (FOMDPs). Our approach combines first-order state abstraction that avoids eval- 
0^ ' uating states individually, and heuristic search that avoids evaluating all states. Firstly, 

- in contrast to existing systems, which start with propositionalizing the FOMDP and then 

perform state abstraction on its propositionalized version we apply state abstraction di- 
rectly on the FOMDP avoiding propositionalization. This kind of abstraction is referred to 
as first-order state abstraction. Secondly, guided by an admissible heuristic, the search is 
c/3 [ restricted to those states that are reachable from the initial state. We demonstrate the use- 

^ , ' fulness of the above techniques for solving FOMDPs with a system, referred to as FluCaP 

(formerly, FCPlanner), that entered the probabilistic track of the 2004 International Plan- 
ning Competition (IPC'2004) and demonstrated an advantage over other planners on the 
problems represented in first-order terms. 
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1. Introduction 

Markov decision processes (MDPs) have been adopted as a representational and computa- 
tional model for decision-theoretic planning problems in much recent work, e.g., by Barto, 
Bradtke, and Singh (1995). The basic solution techniques for MDPs rely on the dynamic 
programming (DP) principle (Boutilier, Dean, & Hanks, 1999). Unfortunately, classical dy- 
namic programming algorithms require explicit enumeration of the state space that grows 
■ exponentially with the number of variables relevant to the planning domain. Therefore, 

these algorithms do not scale up to complex AI planning problems. 

However, several methods that avoid explicit state enumeration have been developed 
recently. One technique, referred to as state abstraction, exploits the structure of the fac- 
tored MDP representation to solve problems efficiently, circumventing explicit state space 
enumeration (Boutilier et al., 1999). Another technique, referred to as heuristic search, 
restricts the computation to states that are reachable from the initial state, e.g., RTDP 
by Barto et al. (1995), envelope DP by Dean, Kaelbling, Kirman, and Nicholson (1995) and 
LAO* by Feng and Hansen (2002). One existing approach that combines both these tech- 
niques is the symbolic LAO* algorithm by Feng and Hansen (2002) which performs heuristic 
search symbolically for factored MDPs. It exploits state abstraction, i.e., manipulates sets of 
states instead of individual states. More precisely, following the SPUDD approach by Hoey, 
St-Aubin, Hu, and Boutilier (1999), all MDP components, value functions, policies, and 
admissible heuristic functions are compactly represented using algebraic decision diagrams 
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(ADDs). This allows computations of the LAO* algorithm to be performed efficiently using 
ADDs. 

Following ideas of symbolic LAO*, given an initial state, we use an admissible heuristic 
to restrict search only to those states that are reachable from the initial state. Moreover, 
we exploit state abstraction in order to avoid evaluating states individually. Thus, our 
work is very much in the spirit of symbolic LAO* but extends it in an important way. 
Whereas the symbolic LAO* algorithm starts with propositionalization of the FOMDP, 
and only after that performs state abstraction on its propositionalized version by means of 
propositional ADDs, we apply state abstraction directly on the structure of the FOMDP, 
avoiding propositionalization. This kind of abstraction is referred to as first-order state 
abstraction. 

Recently, following work by Boutilier, Reiter, and Price (2001), Holldobler and Skvortsova 
(2004) have developed an algorithm, referred to as first-order value iteration (FOVI) that 
exploits first-order state abstraction. The dynamics of an MDP is specified in the Prob- 
abilistic Fluent Calculus established by Holldobler and Schneeberger (1990), which is a 
first-order language for reasoning about states and actions. More precisely, FOVI produces 
a logical representation of value functions and policies by constructing first-order formulae 
that partition the state space into clusters, referred to as abstract states. In effect, the 
algorithm performs value iteration on top of these clusters, obviating the need for explicit 
state enumeration. This allows problems that are represented in first-order terms to be 
solved without requiring explicit state enumeration or propositionalization. 

Indeed, propositionalizing FOMDPs can be very impractical: the number of proposi- 
tions grows considerably with the number of domain objects and relations. This has a 
dramatic impact on the complexity of the algorithms that depends directly on the num- 
ber of propositions. Finally, systems for solving FOMDPs that rely on propositionalizing 
states also propositionalize actions which is problematic in first-order domains, because the 
number of ground actions also grows dramatically with domain size. 

In this paper, we address these limitations by proposing an approach for solving FOMDPs 
that combines first-order state abstraction and heuristic search in a novel way, exploiting 
the power of logical representations. Our algorithm can be viewed as a first-order gener- 
alization of LAO*, in which our contribution is to show how to perform heuristic search 
for first-order MDPs, circumventing their propositionalization. In fact, we show how to 
improve the performance of symbolic LAO* by providing a compact first-order MDP repre- 
sentation using Probabilistic Fluent Calculus instead of propositional ADDs. Alternatively, 
our approach can be considered as a way to improve the efficiency of the FOVI algorithm 
by using heuristic search together with symbolic dynamic programming. 

2. First-order Representation of MDPs 

Recently, several representations for propositionally-factored MDPs have been proposed, 
including dynamic Bayesian networks by Boutilier et al. (1999) and ADDs by Hoey et al. 
(1999). For instance, the SPUDD algorithm by Hoey et al. (1999) has been used to solve 
MDPs with hundreds of millions of states optimally, producing logical descriptions of value 
functions that involve only hundreds of distinct values. This work demonstrates that large 



382 



FluCaP: a Heuristic Search Planner for First-Order MDPs 



MDPs, described in a logical fashion, can often be solved optimally by exploiting the logical 
structure of the problem. 

Meanwhile, many realistic planning domains are best represented in first-order terms. 
However, most existing implemented solutions for first-order MDPs rely on propositional- 
ization, i.e., eliminate all variables at the outset of a solution attempt by instantiating terms 
with all possible combinations of domain objects. This technique can be very impractical 
because the number of propositions grows dramatically with the number of domain objects 
and relations. 

For example, consider the following goal statement taken from the colored Blocksworld 
scenario, where the blocks, in addition to unique identifiers, are associated with colors. 

G = 3Xq ...Xj. red{Xo) A green{Xi) A hlue{X2) A red{X^) A red{Xi)^ 

red{X^) A green{XQ) A green{Xj) A Tower{XQ, . . . , X^) , 

where Tower{XQ, . . . ,Xf) represents the fact that all eight blocks comprise one tower. We 
assume that the number of blocks in the domain and their color distribution agrees with 
that in the goal statement, namely there are eight blocks a,b, . . . ,h in the domain, where 
four of them are red, three are green and one is blue. Then, the full propositionalization 
of the goal statement G results in 4!3!l! = 144 different ground towers, because there are 
exactly that many ways of arranging four red, three green and one blue block in a tower of 
eight blocks with the required color characteristics. 

The number of ground combinations, and hence, the complexity of reasoning in a prepo- 
sitional planner, depends dramatically on the number of blocks and, most importantly, on 
the number of colors in the domain. The fewer colors a domain contains, the harder it is to 
solve by a propositional planner. For example, a goal statement G' , that is the same as G 
above, but all eight blocks are of the same color, results in 8! = 40320 ground towers, when 
grounded. 

To address these limitations, we propose a concise representation of FOMDPs within the 
Probabilistic Fluent Calculus which is a logical approach to modelling dynamically changing 
systems based on first-order logic. But first, we briefly describe the basics of the theory of 
MDPs. 

2.1 MDPs 

A Markov decision process (MDP), is a tuple {Z,A,V,TZ,C), where ^ is a finite set of 
states, ^ is a finite set of actions, and V : Z x Z x A ^ [0,1], written V{z'\z,a), specifies 
transition probabilities. In particular, V{z'\z,a) denotes the probability of ending up at 
state z' given that the agent was in state z and action a was executed. 7^ : 2 — t- M is a real- 
valued reward function associating with each state z its immediate utility TZ{z). C : ^ — )• M 
is a real-valued cost function associating a cost C{a) with each action a. A sequential 
decision problem consists of an MDP and is the problem of finding a policy tt : Z ^ A that 
maximizes the total expected discounted reward received when executing the policy vr over 
an infinite (or indefinite) horizon. 

The value of state z, when starting in z and following the policy tt afterwards, can be 
computed by the following system of linear equations: 

K(^) = 7^(^) + C(7^(z)) + 7 
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where < 7 < 1 is a discount factor. We take 7 equal to 1 for indefinite-horizon problems 
only, i.e., when a goal is reached the system enters an absorbing state in which no further 
rewards or costs are accrued. The optimal value function V* satisfies: 



for each z (z Z. 

For the competition, the expected total reward model was used as the optimality cri- 
terion. Without discounting, some care is required in the design of planning problems to 
ensure that the expected total reward is bounded for the optimal policy. The following 
restrictions were made for problems used in the planning competition: 

1. Each problem had a goal statement, identifying a set of absorbing goal states. 

2. A positive reward was associated with transitioning into a goal state. 

3. A cost was associated with each action. 

4. A "done" action was available in all states, which could be used to end further accu- 
mulation of reward. 

These conditions ensure that an MDP model of a planning problem is a positive bounded 
model described by Puterman (1994). The only positive reward is for transitioning into a 
goal state. Since goal states are absorbing, that is, they have no outgoing transitions, the 
maximum value of any state is bounded by the goal reward. Furthermore, the "done" action 
ensures that there is an action available in each state that guarantees a non-negative future 
reward. 

2.2 Probabilistic Fluent Calculus 

Fluent Calculus (FC) by Holldobler and Schneeberger (1990) was originally set up as a 
first-order logic program with equality using SLDE-resolution as the sole inference rule. 
The Probabilistic Fluent Calculus (PFC) is an extension of the original FC for expressing 
planning domains with actions which have probabilistic effects. 



Formally, let S denote a set of function symbols. We distinguish two function symbols in 
S, namely the binary function symbol o, which is associative, commutative, and admits 
the unit element, and a constant 1. Let S_ = S \ {o, 1}. Non-variable S_-terms are 
called fluents. The function names of fluents are referred to as fluent names. For example, 
on{X, table) is a fluent meaning informally that some block X is on the table, where on 
is a fluent name. Fluent terms are defined inductively as follows: 1 is a fluent term; each 
fluent is a fluent term; F o G is a fluent term, if F and G are fluent terms. For example, 
on{b, table) o holding(X) is a fluent term denoting informally that the block b is on the table 
and some block X is in the robot's gripper. In other words, freely occurring variables are 
assumed to be existentially quantified. 




States 
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We assume that each fluent may occur at most once in a state. Moreover, function 
symbols, except for the binary o operator, constant 1, fluent names and constants, are 
disallowed. In addition, the binary function symbol o is allowed to appear only as an 
outermost connective in a fluent term. We denote a set of fluents as T and a set of fluent 
terms as C?^ ^ respectively. An abstract state is defined by a pair {P,M), where P £ and 
M C £-^. We denote individual states by z, zi, Z2 etc., abstract states by Z, Zi, Z2 etc. 
and a set of abstract states Cpjy. 

The interpretation over denoted as I, is the pair (A, •■^), where the domain A is a set 
of all finite sets of ground fluents from J-"; and an interpretation function --^ which assigns 
to each fluent term F a set F-^ C A and to each abstract state Z = [P,J\f) a set Z-^ C A 
as follows: 

F^ = {deA \ 3e.Fe C d} 

Z^ = {de A \ 39.P9 Cd A V7V G AA. d ^ {NOf}, 

where is a substitution. For example, Figure 1 depicts the interpretation of an abstract 
state Z 

Z = (on(X, a) o on{a, table), {on{Y,X),holding{X')}) 

that can be informally read: There exists a block X that is on the block a which is on 
the table, there is no such block Y that is on X and there exists no such block X' that 
the robot holds. Since Z-^ contains all such finite sets of ground fluents that satisfy the 
P-part and do not satisfy any of the elements of the 7\A-part, we subtract all sets of ground 
fluents that belong to each of Ni £ Af from the set of ground fluents that correspond 
to the P-part. Thus, the bold area in Figure 1 contains exactly those sets of ground 
fluents (or, individual states) that do satisfy the P-part of Z and none of the elements of 
its A/'-part. For example, an individual state zi = {o7i{b,a),on{a,table)} belongs to Z-^, 
whereas Z2 = {on{b,a),on{a,table),holding{c)} does not. In other words, abstract states 
are characterized by means of conditions that must hold in each ground instance thereof 
and, thus, they represent clusters of individual states. In this way, abstract states embody 
a form of state space abstraction. This kind of abstraction is referred to as first-order state 
abstraction. 

Actions 

Actions are first-order terms starting with an action function symbol. For example, the 
action of picking up some block X from another block Y might be denoted as pickup {X, Y). 
Formally, let Na denote a set of action names disjoint with S. An action space is a tuple 
A = {A,Pre,EjJ), where ^ is a set of terms of the form a(pi, . . . referred to as 

actions, with a £ Na and each pi being either a variable, or a constant; Pre : A Cpn is 
a precondition of a; and Eff : A ^ Cpn is an effect of a. 

So far, we have described deterministic actions only. But actions in PFC may have 
probabilistic effects as well. Similar to the work by Boutilier et al. (2001), we decompose a 
stochastic action into deterministic primitives under nature's control, referred to as nature 's 
choices. We use a relation symbol choice/2 to model nature's choice. Consider the action 
pickup {X, Y): 

choice (pickup (X, Y),A) -f-)- 

(A = pickupS{X, y) V ^ = pickupF{X, Y)) , 
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(c) 

Figure 1: (a) Interpretation of the fluent term F = on{X,a) o on{a, table); (b) Bold area is 
the interpretation of the abstract state Z' = {on{X, a) o on{a, table), {on{Y, X)}); 
(c) Bold area is the interpretation of the abstract state Z = {on(X, a) o 
on{a, table), {on{Y, X), holding{X' )}) . 
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where pickupS {X, Y) and pickupF {X, Y) define two nature's choices for action pickup {X, Y), 
viz., that it succeeds or fails. For example, the nature's choice pickupS can be defined as 
follows: 

Pre {pickupS {X,Y)) := {on{X,Y) o e, {on{W, X)}) 
Eff {pickupS {X,Y)) := {holding{X), {on{X,Y)}) , 

where the fluent e denotes the empty robot's gripper. For simplicity, we denote the set of 
nature's choices of an action a as Ch{a) := {aj\choice{a,aj)}. Please note that nowhere 
do these action descriptions restrict the domain of discourse to some pre-specified set of 
blocks. 

For each of nature's choices aj associated with an action a we define the probability 
prob {aj,a, Z) denoting the probability with which one of nature's choices aj is chosen in a 
state Z. For example, 

prob {pickupS {X,Y), pickup {X,Y), Z) = .75 

states that the probability for the successful execution of the pickup action in state Z is 
.75. 

In the next step, we define the reward function for each state. For example, we might 
want to give a reward of 500 to all states in which some block X is on block a and 0, 
otherwise: 

reward {Z) = 500 o Z □ (on(X,a),0) 
reward {Z) = ^ Z ^ {o7i{X, a), 0) , 

where 1^ denotes the subsumption relation, which will be described in detail in Section 3.2.1. 
One should observe that we have specified the reward function without explicit state enu- 
meration. Instead, the state space is divided into two abstract states depending on whether 
or not, a block X is on block a. Likewise, value functions can be specified with respect to 
the abstract states only. This is in contrast to classical DP algorithms, in which the states 
are explicitly enumerated. Action costs can be analogously defined as follows: 

cost{pickup {X ,Y)) = 3 

penalizing the execution of the pickup -action with the value of 3. 

Inference Mechanism 

Herein, we show how to perform inferences, i.e., compute successors of a given abstract state, 
with action schemata directly, avoiding unnecessary grounding. We note that computation 
of predecessors can be performed in a similar way. 

Let Z = {P,J\f) be an abstract state, a{pi,...,pn) be an action with parameters 
pi,...,Pn, preconditions Pre{a) = {Pp,J\fp) and effects Eff {a) = {Pe,Afe)- Let 6 and a 
be substitutions. An action a{pi, . . . ,pn) is forward applicable, or simply applicable, to Z 
with 9 and a, denoted as forward {Z, a, 9, a), if the following conditions hold: 

(fl) {PpoUi)9=AClP 

(f2) ViVp G f^p3N G AA.(P o iV o U2)(7 =aci {P ° Np)9 , 
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where Ui and U2 are new ACl-variables and ACl is the equational theory for o that is 
represented by the fohowing system of "associativity" , "commutativity" , and "unit element" 
equations: 

g^ci = { (VX, Y, Z) X o {Y o Z) = {X oY) o Z 
(VX, Y) X oY = Y o X 
(VX) Xol = X } . 

In other words, the conditions (fl) and (f2) guarantee that Z contains both positive and 
negative preconditions of the action a. If an action a is forward apphcable to Z with 9 and 
a then Zgucc = iP',J^'), where 

P' ■.= {PeOUi)e 

M' :=Ma\Mpe U MeO ^ ' 

is referred to as the a-successor of Z with 9 and a and denoted as succ{Z,a,9,a). 

For example, consider the action pickupS {X,Y) as defined above, take Z = {P,M) = 
{on(b, table) o on{Xi, b) o e, {on{X2, Xi)}). The action pickupS{X, Y) is forward applicable 
to Z with 9 = {X Xi,Y ^ b,Ui ^ on{b, table)} and a = {X2 ^ W,U2 ^ 1}. Thus, 
Zsucc = succ{Z, pickupS {X, Y), 9, a) = {P',M') with 

P' = holding{Xi) o on{b, table) M' = {on{Xi,b)} . 
3. First-Order LAO* 

We present a generalization of the symbolic LAO* algorithm by Feng and Hansen (2002), 
referred to as first-order LAO* (FOLAO*), for solving FOMDPs. Symbolic LAO* is a 
heuristic search algorithm that exploits state abstraction for solving factored MDPs. Given 
an initial state, symbolic LAO* uses an admissible heuristic to focus computation on the 
parts of the state space that are reachable from the initial state. Moreover, it specifies MDP 
components, value functions, policies, and admissible heuristics using propositional ADDs. 
This allows symbolic LAO* to manipulate sets of states instead of individual states. 

Despite the fact that symbolic LAO* shows an advantageous behaviour in comparison 
to classical non-symbolic LAO* by Hansen and Zilberstein (2001) that evaluates states 
individually, it suffers from an important drawback. While solving FOMDPs, symbolic 
LAO* propositionalizes the problem. This approach is impractical for large FOMDPs. Our 
intention is to show how to improve the performance of symbolic LAO* by providing a 
compact first-order representation of MDPs so that the heuristic search can be performed 
without propositionalization. More precisely, we propose to switch the representational 
formalism for FOMDPs in symbolic LAO* from propositional ADDs to Probabilistic Fluent 
Calculus. The FOLAO* algorithm is presented in Figure 2. 

As symbolic LAO*, FOLAO* has two phases that alternate until a complete solution 
is found, which is guaranteed to be optimal. First, it expands the best partial policy and 
evaluates the states on its fringe using an admissible heuristic function. Then it performs 
dynamic programming on the states visited by the best partial policy, to update their values 
and possibly revise the current best partial policy. We note that we focus on partial policies 
that map a subcollection of states into actions. 
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policy E xpansion{-K , 5*", G) 
E:=F:=% 
from ■- S° 
repeat 

to := [J U {succ{Z,aj,6,a)}, 

where {a, 9, a) := n{Z) 

F ■-F\j{to- G) 

E := EU from 

from — tonG- E 
until {from = 0) 
E~EUF 
G--GUF 
return {E, F, G) 

FOVI(-B, A, prob, reward, cost, j, V) 
repeat 

V ■- V 

loop for each Z £ E 
loop for each a £ A 

loop for each 6, a such that forward (Z, a, 9, a) 
Q(Z, a, 6, a) := reward{Z) + cost{a) + 

7 Yl prob{aj,a,Z) ■V'{succ{Z,aj,0,a)) 

end loop 
end loop 

V{Z) ■- max Q{Z,a,e,a) 

{a.O.cr) 

end loop 

V :— normalizeiy) 
r:=\\V-V'\\ 

until stopping criterion 
TT := extractPolicyiy) 
return [V, tt, r) 

FOLAO* (.4, prob, reward, cost, ^,S°,h,e) 
V — h 
G--% 

For each Z £ S" , initialize tt with an arbitrary action 
repeat 

{E, F, G) :— policy Expansion{-K , S'^ , G) 

{V, TT, r) :— FOVI(i5, A, prob, reward, cost, 7, V) 

until {F = 0) and r <e 

return (tt, V) 



Figure 2: First-order LAO* algorithm. 



In the pohcy expansion step, we perform reachabiUty analysis to find the set F of states 
that have not yet been expanded, but are reachable from the set of initial states by 
following the partial policy tt. The set of states G contains states that have been expanded 
so far. By expanding a partial policy we mean that it will be defined for a larger set of 
states in the dynamic programming step. In symbolic LAO*, reachability analysis on ADDs 
is performed by means of the image operator from symbolic model checking, that computes 
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the set to of successor states following the best current policy. Instead, in FOLAO*, we apply 
the succ-operator, defined in Equation 1. One should observe that since the reachability 
analysis in FOLAO* is performed on abstract states that are defined as first-order entities, 
the reasoning about successor states is kept on the first-order level. In contrast, symbolic 
LAO* would first instantiate S'^ with all possible combinations of objects, in order to be 
able to perform computations using propositional ADDs later on. 

In contrast to symbolic LAO*, where the dynamic programming step is performed using 
a modified version of SPUDD, we employ a modified first-order value iteration algorithm 
(FOVI). The original FOVI by Holldobler and Skvortsova (2004) performs value iteration 
over the entire state space. We modify it so that it computes on states that are reachable 
from the initial states, more precisely, on the set E of states that are visited by the best cur- 
rent partial policy. In this way, we improve the efficiency of the original FOVI algorithm by 
using reachability analysis together with symbolic dynamic programming. FOVI produces 
a PFC representation of value functions and policies by constructing first-order formulae 
that partition the state space into abstract states. In effect, it performs value iteration on 
top of abstract states, obviating the need for explicit state enumeration. 

Given a FOMDP and a value function represented in PFC, FOVI returns the best partial 
value function V, the best partial policy vr and the residual r. In order to update the values 
of the states Z in E, we assign the values from the current value function to the successors 
of Z. We compute successors with respect to all nature's choices aj. The residual r is 
computed as the absolute value of the largest difference between the current and the newly 
computed value functions V' and V, respectively. We note that the newly computed value 
function V is taken in its normalized form, i.e., as a result of the normalize procedure that 
will be described in Section 3.2.1. Extraction of a best partial policy vr is straightforward: 
One simply needs to extract the maximizing actions from the best partial value function V. 

As with symbolic LAO*, FOLAO* converges to an e-optimal policy when three condi- 
tions are met: (1) its current policy does not have any unexpanded states, (2) the residual 
r is less than the predefined threshold e, and (3) the value function is initialized with an ad- 
missible heuristic. The original convergence proofs for LAO* and symbolic LAO* by Hansen 
and Zilberstein (2001) carry over in a straightforward way to FOLAO*. 

When calling FOLAO*, we initialize the value function with an admissible heuristic 
function h that focuses the search on a subset of reachable states. A simple way to create 
an admissible heuristic is to use dynamic programming to compute an approximate value 
function. Therefore, in order to obtain an admissible heuristic h in FOLAO*, we perform 
several iterations of the original FOVI. We start the algorithm on an initial value function 
that is admissible. Since each step of FOVI preserves admissibility, the resulting value 
function is admissible as well. The initial value function assigns the goal reward to each 
state thereby overestimating the optimal value, since the goal reward is the maximal possible 
reward. 

Since all computations of FOLAO* are performed on abstract states instead of individual 
states, FOMDPs are solved avoiding explicit state and action enumeration and proposition- 
alization. The first-order reasoning leads to better performance of FOLAO* in comparison 
to symbolic LAO*, as shown in Section 4. 
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E ={So,Si,S2} 
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to = {84,85} 
Sl F ={83,84,85} 
I E ={80} 

;G = {81,82,83,84,85} 

7sX\ 




Figure 3: Policy Expansion. 



3.1 Policy Expansion 

The policy expansion step in FOLAO* is very similar to the one in the symbolic LAO* 
algorithm. Therefore, we illustrate the expansion procedure by means of an example. As- 
sume that we start from the initial state Zq and two nondeterministic actions and are 
applicable in Zq, each having two outcomes a}, 02 and af, a|, respectively. Without loss 
of generality, we assume that the current best policy vr chooses as an optimal action at 
state Zq. We construct the successors Zi and Z2 of Zq with respect to both outcomes a\ 
and 02 of the action a^. 

The fringe set F as well as the set G of states expanded so far contain the states Zi 
and Z2 only, whereas, the set E of states visited by the best current partial policy gets the 
state Zq in addition. See Figure 3a. In the next step, FOVI is performed on the set E. We 
assume that the values have been updated in such a way that becomes an optimal action 
in Zq. Thus, the successors of Zq have to be recomputed with respect to the optimal action 
a^. See Figure 3b. 

One should observe that one of the a^-successors of Zq, namely Z2, is an element of the 
set G and thus, it has been contained already in the fringe F during the previous expansion 
step. Hence, the state Z2 should be expanded and its value recomputed. This is shown 
in Figure 3c, where states Z4 and Z^ are a^-successors of Z2, under assumption that 
is an optimal action in Z2. As a result, the fringe set F contains the newly discovered 
states Z3, Z4 and Z5 and we perform FOVI on ii^ = {Zq, Z2, Z3, Z4, Z^}. The state Zi is 
not contained in E, because it does not belong to the best current partial policy, and the 
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dynamic programming step is performed only on the states that were visited by the best 
current partial policy. 

3.2 First-Order Value Iteration 

In FOLAO*, the first-order value iteration algorithm (FOVI) serves two purposes: First, we 
perform several iterations of FOVI in order to create an admissible heuristic h in FOLAO*. 
Second, in the dynamic programming step of FOLAO*, we apply FOVI on the states visited 
by the best partial policy in order to update their values and possibly revise the current 
best partial policy. 

The original FOVI by Holldobler and Skvortsova (2004) takes a finite state space of 
abstract states, a finite set of stochastic actions, real-valued reward and cost functions, and 
an initial value function as input. It produces a first-order representation of the optimal 
value function and policy by exploiting the logical structure of a FOMDP. Thus, FOVI can 
be seen as a first-order counterpart of the classical value iteration algorithm by Bellman 
(1957). 

3.2.1 Normalization 

Following the ideas of Boutilier et al. (2001), FOVI relies on the normalization of the state 
space that represents the value function. By normalization of a state space, we mean an 
equivalence-preserving procedure that reduces the size of a state space. This would have an 
effect only if a state space contains redundant entries, which is usually the case in symbolic 
computations. 

Although normalization is considered to be an important issue, it has been done by 
hand so far. To the best of our knowledge, the preliminary implementation of the ap- 
proach by Boutilier et al. (2001) performs only rudimentary logical simplifications and the 
authors suggest using an automated first-order theorem prover for the normalization task. 
Holldobler and Skvortsova (2004) have developed an automated normalization procedure 
for FOVI that, given a state space, delivers an equivalent one that contains no redundancy. 
The technique employs the notion of a subsumption relation. 

More formally, let Zi = (Pi,A/i) and Z2 = {P2,J^2) be abstract states. Then Zi is said 
to be subsumed by Z2, written Zi Q Z2, if and only if there exist substitutions 9 and a such 
that the following conditions hold: 

(si) {P2oUi)e=AClPl 

(s2) VV2 G A/'2.3iVi G Mi-iPi o Ni o U2)a =aci {Pi ° N2)e , 

where Ui and U2 are new ACl-variables. The motivation for the notion of subsumption 
on abstract states is inherited from the notion of ^-subsumption between first-order clauses 
by Robinson (1965) with the difference that abstract states contain more complicated neg- 
ative parts in contrast to the first-order clauses. 

For example, consider two abstract states Zi and Z2 that are defined as follows: 

Zi = (on( Vi,o) o on{a stable), {rediYi)}) 
Z2 = {on{X2,a),{red{X2)}) , 
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N 


Number of states 


Time, 


msec 


Runtime, msec 


Runtime w/o norm, msec 


'^update 


*^norm 


Update 


Norm 





9 


6 


144 


1 


145 


144 


1 


24 


14 


393 


3 


396 


593 


2 


94 


23 


884 


12 


896 


2219 


3 


129 


33 


1377 


16 


1393 


13293 


4 


328 


39 


2079 


46 


2125 


77514 


5 


361 


48 


2519 


51 


2570 


805753 


6 


604 


52 


3268 


107 


3375 


n/a 


7 


627 


54 


3534 


110 


3644 


n/a 


8 


795 


56 


3873 


157 


4030 


n/a 


9 


811 


59 


4131 


154 


4285 


n/a 



Table 1: Representative timing results for first ten iterations of FOVI. 

where Z\ informally asserts that some block X\ is on the block a which is on the table and 
no blocks are red. Whereas Z2 informally states that some block X2 is on the block o and 
X2 is not red. We show that Z\ C Z2. The relation holds since both conditions (si) and 
(s2) are satisfied. Indeed, 

(o?i(X2, a) o U\)d =Aci on(Xi,a) o on{a, table) 

and 

{on{Xi, a) o on{a, table) o red{Yi) o U2)cr = {on{Xi, a) o on{a, table) o red{X2))0 

with 6 = {X2 ^ Xi,Ui on{a, table)} and a = {Yi ^ Xi,U2 ^ 1}. 

One should note that subsumption in the language of abstract states inherits the com- 
plexity bounds of 0-subsumption (Kapur & Narendran, 1986). Namely, deciding subsump- 
tion between two abstract states is NP-complete, in general. However, Karabaev et al. 
(2006) have recently developed an efficient algorithm that delivers all solutions of the sub- 
sumption problem for the case where abstract states are fluent terms. 

For the purpose of normalization, it is convenient to represent the value function as a 
set of pairs of the form (Z, q), where Z is an abstract state and a is a real value. In essence, 
the normalization algorithm can be seen as an exhaustive application of the following sim- 
plification rule to the value function V . 

(Zi,a) (Z2,a) 7 ^ ^ 

— 7 Zi L Z2 

(Z2,a) 

Table 1 illustrates the importance of the normalization algorithm by providing some repre- 
sentative timing results for the first ten iterations of FOVI. The experiments were carried 
out on the problem taken from the colored Blocksworld scenario consisting of ten blocks. 
Even on such a relatively simple problem FOVI with the normalization switched off does 
not scale beyond the sixth iteration. 

The results in Table 1 demonstrate that the normalization during some iteration of 
FOVI dramatically shrinks the computational effort during the next iterations. The columns 
labelled ^update and 5norm show the size of the state space after performing the value updates 
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and the normalization, respectively. For example, the normalization factor, i.e., the ratio 
between the number ^Supdate of states obtained after performing one update step and the 
number 5norm of states obtained after performing the normalization step, at the seventh 
iteration is 11.6. This means that more than ninety percent of the state space contained 
redundant information. The fourth and fifth columns in Table 1 contain the time Update 
and Norm spent on performing value updates and on the normalization, respectively. The 
total runtime Runtime, when the normalization is switched on, is given in the sixth column. 
The seventh column labelled Runtime w/o norm depicts the total runtime of FOVI when the 
normalization is switched off. If we would sum up all values in the seventh column and the 
values in the sixth column up to the sixth iteration inclusively, subtract the latter from the 
former and divide the result by the total time Norm needed for performing normalization 
during the first six iterations, then we would obtain the normalization gain of about three 
orders of magnitude. 

4. Experimental Evaluation 

We demonstrate the advantages of combining the heuristic search together with first-order 
state abstraction on a system, referred to as FluCaP, that has successfully entered the 
probabilistic track of the 2004 International Planning Competition (IPC'2004). The exper- 
imental results were all obtained using RedHat Linux running on a 3.4GHz Pentium IV 
machine with 3GB of RAM. 

In Table 2, we present the performance comparison of FluCaP together with symbolic 
LAO* on examples taken from the colored Blocksworld (BW) scenario that was introduced 
during IPC'2004. 

Our main objective was to investigate whether first-order state abstraction using logic 
could improve the computational behaviour of a planning system for solving FOMDPs. The 
colored BW problems were our main interest since they were the only ones represented in 
first-order terms and hence the only ones that allowed us to make use of the first-order 
state abstraction. Therefore, we have concentrated on the design of a domain-dependent 
planning system that was tuned for the problems taken from the Blocksworld scenario. 

The colored BW problems differ from the classical BW ones in that, along with the 
unique identifier, each block is assigned a specific color. A goal formula, specified in first- 
order terms, provides an arrangement of colors instead of an arrangement of blocks. 

At the outset of solving a colored BW problem, symbolic LAO* starts by propositionaliz- 
ing its components, namely, the goal statement and actions. Only after that, the abstraction 
using propositional ADDs is applied. In contrast, FluCaP performs first-order abstrac- 
tion on a colored BW problem directly, avoiding unnecessary grounding. In the following, 
we show how an abstraction technique affects the computation of a heuristic function. To 
create an admissible heuristic, FluCaP performs twenty iterations of FOVI and symbolic 
LAO* performs twenty iterations of an approximate value iteration algorithm similar to 
APRICODD by St-Aubin, Hoey, and Boutilier (2000). The columns labelled H.time and 
NAS show the time needed for computing a heuristic function and the number of abstract 
states it covers, respectively. In comparison to FluCaP, symbolic LAO* needs to evaluate 
fewer abstract states in the heuristic function but takes considerably more time. One can 



394 



FluCaP: a Heuristic Search Planner for First-Order MDPs 



Problem 


Total av 


reward 


Total time, sec. 


H.time 


sec. 


NAS 


NGS 


xlO" 


% 








CL 




1 

CL 




CL 




CL 




CL 




CL 






CL 


CL 








ro 




ra 


* 


nj 






* 






nj 




* 


ra 


nj 






O 


U 


> 


U 


O 


O 


> 


u 


O 


U 


O 


O 


> 


O 


U 


O 


B 




< 




O 




< 


3 


O 




< 




< 


3 


O 


< 




3 


C 


_] 


LL 


LL 


LL 


_] 


LL 


LL 


LL 




LL 


_] 


LL 


LL 


_] 


LL 


LL 




4 


494 


494 


494 


494 


22.3 


22.0 


23.4 


31.1 


8.7 


4.2 


35 


410 


1077 


0.86 


0.82 


2.7 


5 


3 


496 


495 


495 


496 


23.1 


17.8 


22.7 


25.1 


9.5 


1.3 


34 


172 


687 


0.86 


0.68 


2.1 




2 


496 


495 


495 


495 


27.3 


11.7 
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55 
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17 
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n/a 
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3548 
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1751 


n/a 


15225 
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Table 2: Performance comparison of FluCaP (denoted as FluCaP) and symbolic LAO* 
(denoted as LAO*), where the cells n/a denote the fact that a planner did not 
deliver a solution within the time limit of one hour. NAS and NGS are number of 
abstract and ground states, respectively. 



conclude that abstract states in symbolic LAO* enjoy more complex structure than those 
in FluCaP. 

We note that, in comparison to FOVI, FluCaP restricts the value iteration to a smaller 
state space. Intuitively, the value function, which is delivered by FOVI, covers a larger 
state space, because the time that is allocated for the heuristic search in FluCaP is now 
used for performing additional iterations in FOVI. The results in the column labelled % 
justify that the harder the problem is (that is, the more colors it contains), the higher the 
percentage of runtime spent on normalization. Almost on all test problems, the effort spent 
on normalization takes three percent of the total runtime on average. 

In order to compare the heuristic accuracy, we present in the column labelled NGS the 
number of ground states which the heuristic assigns non-zero values to. One can see that the 
heuristics returned by FluCaP and symbolic LAO* have similar accuracy, but FluCaP 
takes much less time to compute them. This reflects the advantage of the plain first-order 
abstraction in comparison to the marriage of propositionalization with abstraction using 
propositional ADDs. In some examples, we gain several orders of magnitude in H.time. 

The column labelled Total time presents the time needed to solve a problem. During this 
time, a planner must execute 30 runs from an initial state to a goal state. A one-hour block 
is allocated for each problem. We note that, in comparison to FluCaP, the time required 
by heuristic search in symbolic LAO* (i.e., difference between Total time and H.time) grows 
considerably faster in the size of the problem. This reflects the potential of employing 
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36 
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Table 3: Performance of FluCaP on larger instances of one-color Blocksworld problems, 
where the cells n/a denote the fact that a planner did not deliver a solution within 
the time limit. 



first-order abstraction instead of abstraction based on propositional ADDs during heuristic 
search. 

The average reward obtained over 30 runs, shown in column Total av. reward, is the 
planner's evaluation score. The reward value close to 500 (which is the maximum possible 
reward) simply indicates that a planner found a reasonably good policy. Each time the 
number of blocks B increases by 1, the running time for symbolic LAO* increases roughly 
10 times. Thus, it could not scale to problems having more than seven blocks. This is 
in contrast to FluCaP which could solve problems of seventeen blocks. We note that 
the number of colors C in a problem affects the efficiency of an abstraction technique. In 
FluCaP, as C decreases, the abstraction rate increases which, in turn, is reflected by the 
dramatic decrease in runtime. The opposite holds for symbolic LAO*. 

Li addition, we compare FluCaP with two variants. The first one, denoted as FOVI, 
performs no heuristic search at all, but rather, employs FOVI to compute the e-optimal 
total value function from which a policy is extracted. The second one, denoted as FluCaP", 
performs 'trivial' heuristic search starting with an initial value function as an admissible 
heuristic. As expected, FluCaP that combines heuristic search and FOVI demonstrates 
an advantage over plain FOVI and trivial heuristic search. These results illustrate the 
significance of heuristic search in general (FluCaP vs. FOVI) and the importance of heuristic 
accuracy, in particular (FluCaP vs. FluCaP"). FOVI and FluCaP" do not scale to problems 
with more than seven blocks. 

Table 3 presents the performance results of FluCaP on larger instances of one-color 
BW problems with the number of blocks varying from twenty to thirty four. We believe that 
FluCaP does not scale to problems of larger size because the implementation is not yet 
well optimized. In general, we believe that the FluCaP system should not be as sensitive 
to the size of a problem as propositional planners are. 

Our experiments were targeted at the one-color problems only because they are, on the 
one hand, the simplest ones for us and, on the other hand, the bottleneck for propositional 
planners. The structure of one-color problems allows us to apply first-order state abstrac- 
tion in its full power. For example, for a 34-blocks problem FluCaP operates on about 
3.3 thousand abstract states that explode to 9.6 x 10^^ individual states after proposition- 
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Total av. reward, <500 



c 
a 
■a 



11 



494.6 
486.5 
479.7 



496.4 
492.8 
486.3 



n/a 
n/a 
n/a 



n/a 
n/a 
n/a 



496.5 
486.6 
481.3 



496.5 
486.4 
481.5 



495.8 
487.2 
481.9 



n/a 
n/a 
n/a 



n/a 
n/a 
n/a 



5 

8 
11 
15 
18 
21 



494.6 
489.7 
479.1 
467.5 
351.8 
285.7 



494.6 
489.9 

n/a 
n/a 
n/a 
n/a 



494.8 
n/a 
n/a 
n/a 
n/a 
n/a 



n/a 
n/a 
n/a 
n/a 
n/a 
n/a 



494.1 
488.7 
480.3 
469.4 
462.4 
455.7 



494.6 
490.3 
479.7 
467.7 
-54.9 
455.1 



494.4 
490 
481.1 
486.3 
n/a 
459 



494.9 
488.8 
465.7 
397.2 

n/a 
n/a 



494.1 
n/a 
n/a 
n/a 
n/a 
n/a 



Table 4: Official competition results for colored and non-colored Blocksworld scenarios. 

May, 2004. The n/a-entries in the table indicate that either a planner was not 
successful in solving a problem or did not attempt to solve it. 



alization. A propositional planner must be highly optimized in order to cope with this 
non-trivial state space. 

We note that additional colors in larger instances (more than 20 blocks) of BW problems 
cause dramatic increase in computational time, so we consider these problems as being 
unsolved. One should also observe that the number of abstract states NAS increases with 
the number of blocks non-monotonically because the problems are generated randomly. For 
example, the 30-blocks problem happens to be harder than the 34-blocks one. Finally, we 
note that all results that appear in Tables 2 and 3 were obtained by using the new version of 
the evaluation software that does not rely on propositionalization in contrast to the initial 
version that was used during the competition. 

Table 4 presents the competition results from IPC'2004, where FluCaP was competitive 
in comparison with other planners on colored BW problems. FluCaP did not perform 
well on non-colored BW problems because these problems were propositional ones (that 
is, goal statements and initial states are ground) and FluCaP does not yet incorporate 
optimization techniques applied in modern propositional planners. The contestants are 
indicated by their origin. For example, Dresden - FluCaP, UMass - symbolic LAO* etc. 
Because only the pickup action has cost 1, the gain of five points in total reward means 
that the plan contains ten fewer actions on average. The competition domains and log files 
are available in an online appendix of Younes, Littman, Weissman, and Asmuth (2005). 

Although the empirical results that are presented in this work were obtained on the 
domain-dependent version of FluCaP, we have recently developed in (Karabaev et al., 
2006) an efficient domain-independent inference mechanism that is the core of a domain- 
independent version of FluCaP. 
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5. Related Work 

We follow the symbolic DP (SDP) approach within Situation Calculus (SC) of Boutilier 
et al. (2001) in using first-order state abstraction for FOMDPs. One difference is in the 
representation language: We use PFC instead of SC. In the course of symbolic value it- 
eration, a state space may contain redundant abstract states that dramatically affect the 
algorithm's efficiency. In order to achieve computational savings, normalization must be per- 
formed to remove this redundancy. However, in the original work by Boutilier et al. (2001) 
this was done by hand. To the best of our knowledge, the preliminary implementation of 
the SDP approach within SC uses human-provided rewrite rules for logical simplification. 
In contrast, Holldobler and Skvortsova (2004) have developed an automated normalization 
procedure for FOVI that is incorporated in the competition version of FluCaP and brings 
the computational gain of several orders of magnitude. Another crucial difference is that our 
algorithm uses heuristic search to limit the number of states for which a policy is computed. 

The ReBel algorithm by Kersting, van Otterlo, and De Raedt (2004) relates to FOLAO* 
in that it also uses a representation language that is simpler than Situation Calculus. This 
feature makes the state space normalization computationally feasible. 

In motivation, our approach is closely connected to Relational Envelope-based Planning 
(REBP) by Cardiol and Kaelbling (2003) that represents MDP dynamics by a compact set 
of relational rules and extends the envelope method by Dean et al. (1995). However, REBP 
propositionalizes actions first, and only afterwards employs abstraction using equivalence- 
class sampling. In contrast, FOLAO* directly applies state and action abstraction on the 
first-order structure of an MDP. In this respect, REBP is closer to symbolic LAO* than to 
FOLAO*. Moreover, in contrast to PFC, action descriptions in REBP do not allow negation 
to appear in preconditions or in effects. In organization, FOLAO*, as symbolic LAO*, is 
similar to real-time DP by Barto et al. (1995) that is an online search algorithm for MDPs. 
In contrast, FOLAO* works offline. 

All the above algorithms can be classified as deductive approaches to solving FOMDPs. 
They can be characterized by the following features: (1) they are model-based, (2) they 
aim at exact solutions, and (3) logical reasoning methods are used to compute abstractions. 
We should note that FOVI aims at exact solution for a FOMDP, whereas FOLAO*, due 
to the heuristic search that avoids evaluating all states, seeks for an approximate solution. 
Therefore, it would be more appropriate to classify FOLAO* as an approximate deductive 
approach to FOMDPs. 

In another vein, there is some research on developing inductive approaches to solving 
FOMDPs, e.g., by Fern, Yoon, and Givan (2003). The authors propose the approximate 
policy iteration (API) algorithm, where they replace the use of cost-function approximations 
as policy representations in API with direct, compact state-action mappings, and use a 
standard relational learner to learn these mappings. In effect. Fern et al. provide policy- 
language biases that enable solution of very large relational MDPs. All inductive approaches 
can be characterized by the following features: (1) they are model-free, (2) they aim at 
approximate solutions, and (3) an abstract model is used to generate biased samples from 
the underlying FOMDP and the abstract model is altered based on them. 

A recent approach by Gretton and Thiebaux (2004) proposes an inductive policy con- 
struction algorithm that strikes a middle-ground between deductive and inductive tech- 
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niques. The idea is to use reasoning, in particular first-order regression, to automatically 
generate a hypothesis language, which is then used as input by an inductive solver. The 
approach by Gretton and Thiebaux is related to SDP and to our approach in the sense that 
a first-order domain specification language as well as logical reasoning are employed. 

6. Conclusions 

We have proposed an approach that combines heuristic search and first-order state ab- 
straction for solving FOMDPs more efficiently. Our approach can be seen as two-fold: 
First, we use dynamic programming to compute an approximate value function that serves 
as an admissible heuristic. Then heuristic search is performed to find an exact solution 
for those states that are reachable from the initial state. In both phases, we exploit the 
power of first-order state abstraction in order to avoid evaluating states individually. As 
experimental results show, our approach breaks new ground in exploring the efficiency of 
first-order representations in solving MDPs. In comparison to existing MDP planners that 
must propositionalize the domain, e.g., symbolic LAO*, our solution scales better on larger 
FOMDPs. 

However, there is plenty remaining to be done. For example, we are interested in the 
question of to what extent the optimization techniques applied in modern propositional 
planners can be combined with first-order state abstraction. In future competitions, we 
would like to face problems where the goal and/or initial states are only partially defined 
and where the underlying domain contains infinitely many objects. 

The current version of FOLAO* is targeted at the problems that allow for efficient 
first-order state abstraction. More precisely, these are the problems that can be polyno- 
mially translated into PFC. For example in the colored BW domain, existentially-closed 
goal descriptions were linearly translated into the equivalent PFC representation. Whereas 
universally-closed goal descriptions would require full propositionalization. Thus, the cur- 
rent version of PFC is less first-order expressive than, e.g.. Situation Calculus. In the future, 
it would be interesting to study the extensions of the PFC language, in particular, to find 
the trade-off between the PFC's expressive power and the tractability of solution methods 
for FOMDPs based on PFC. 
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